GEOG 5160 6160 Lab 10

Author
Affiliation

Simon Brewer

University of Utah

Published

October 28, 2025

Introduction

This notebook will walk through setting up a model for the semantic segmentation of images using convolutional neural networks (CNNs). Unlike classification, where we attempt to predict a label associated with an image (e.g. cat or dog), in semantic segmentation, we are trying to label each pixel within an image. This is usually done by providing a corresponding mask for each training image that indicates which pixels belong to which class. The example used here is based on a set of aerial images taken across Dubai and used in a Kaggle competition:

https://www.kaggle.com/datasets/humansintheloop/semantic-segmentation-of-aerial-imagery

There are a total of 72 images and masks in this dataset. In the interest of making this tractable in a class, we’ll just train the model using a subset (18) of these images, and only for a few epochs. With a relatively small dataset, the goal of this lab is demonstrate how to build and evaluate these models. I would not expect to get a very high level of accuracy without increasing both the size of the data and the number of epochs.

Code for the UNet model in this example has been modified from https://github.com/r-tensorflow/unet/tree/master

Objectives

  • Build a simple segmentation model in TF/Keras
  • Understand how to build an encoder and decoder branch in a convolutional neural network
  • How to use skip-connections to preserve spatial structure

Data processing

First, let’s load some libraries

library(fs)
library(tensorflow)
library(keras3)

Attaching package: 'keras3'
The following objects are masked from 'package:tensorflow':

    set_random_seed, shape

Next, we’ll get the images. These are available through the class Google drive in the zip file unet_images3.zip. Download this now, and move it to a folder that is easy to find on your computer, and unzip it. This will create a set of folders that look like this:

- images3
    - images
    - masks

In each of these you’ll find matching images. The images folder contains the RGB image as JPEGs, and the masks folder contains the matching mask as PNG files. The file names should match, so image_part_001_000.png will be the mask for image_part_001_000.jpg. These files are smaller tiles created from the original images. If you want to see what the original images look like, download and unzip the file unet_images2.zip. If you have this, you can load an example of each. First, we’ll make a couple of functions to display images using keras functions:

display_image_tensor <- function(x, ..., max = 255,
                                 plot_margins = c(0, 0, 0, 0)) {   
  if(!is.null(plot_margins))
    par(mar = plot_margins)
  x |>
    as.array() |>
    drop() |>
    as.raster(max = max) |>
    plot(..., interpolate = FALSE)
}

display_target_tensor <- function(target) {
  display_image_tensor(target, max = 5)   
}

Now get the list of full images:

data_dir <- path("./datafiles/images2/")

input_dir <- data_dir / "images/"
target_dir <- data_dir / "masks/"

image_paths <- tibble::tibble(
  input = sort(dir_ls(input_dir, glob = "*.jpg")),
  target = sort(dir_ls(target_dir, glob = "*.png")))

And here’s the first image:

image_paths$input[1] |>
  tf$io$read_file() |>
  tf$io$decode_jpeg() |>
  display_image_tensor()

And the corresponding mask:

image_paths$target[1] |>
  tf$io$read_file() |>
  tf$io$decode_png() |>
  display_image_tensor()

Now let’s take a look at the tiles in images3/. We’ll make a list of the full paths to both images and masks for use in training the model

data_dir <- path("./datafiles/images3/")
dir_create(data_dir)

input_dir <- data_dir / "images/"
target_dir <- data_dir / "masks/"

image_paths <- tibble::tibble(
  input = sort(dir_ls(input_dir, glob = "*.jpg")),
  target = sort(dir_ls(target_dir, glob = "*.png")))

image_paths
# A tibble: 2,016 × 2
   input                                             target                     
   <fs::path>                                        <fs::path>                 
 1 ./datafiles/images3/images/image_part_001_000.jpg …sks/image_part_001_000.png
 2 ./datafiles/images3/images/image_part_001_001.jpg …sks/image_part_001_001.png
 3 ./datafiles/images3/images/image_part_001_002.jpg …sks/image_part_001_002.png
 4 ./datafiles/images3/images/image_part_001_003.jpg …sks/image_part_001_003.png
 5 ./datafiles/images3/images/image_part_001_004.jpg …sks/image_part_001_004.png
 6 ./datafiles/images3/images/image_part_001_005.jpg …sks/image_part_001_005.png
 7 ./datafiles/images3/images/image_part_001_006.jpg …sks/image_part_001_006.png
 8 ./datafiles/images3/images/image_part_001_007.jpg …sks/image_part_001_007.png
 9 ./datafiles/images3/images/image_part_001_008.jpg …sks/image_part_001_008.png
10 ./datafiles/images3/images/image_part_001_009.jpg …sks/image_part_001_009.png
# ℹ 2,006 more rows

If we plot the first image, you should see that it is the top-left corner of the original image

image_paths$input[1] |>
  tf$io$read_file() |>
  tf$io$decode_jpeg() |>
  display_image_tensor()

We’ll load the matching mask as well. Note that this has been converted to an integer mask, with 6 possible classes:

Building = 0
Land = 1
Road = 2
Vegetation = 3
Water = 4
Unlabeled = 5
image_paths$target[1] |>
  tf$io$read_file() |>
  tf$io$decode_png() |>
  display_target_tensor()

Next, we’ll create two tensorflow datasets that hold the images. As this is a fairly small dataset, we’ll simply read the images into memory. For larger sets, we would need to create a data generator here. We’ll first make a couple of helper functions:

  • A function to read images
  • A function to resize images
  • A function to make the gather the images into a dataset
library(tfdatasets)

Attaching package: 'tfdatasets'
The following object is masked from 'package:keras3':

    shape
tf_read_image <-
  function(path, format = "image", resize = NULL, ...) {
    
    img <- path |>
      tf$io$read_file() |>
      tf$io[[paste0("decode_", format)]](...)
    
    if (!is.null(resize))
      img <- img |>
        tf$image$resize(as.integer(resize))
    img
  }

tf_read_image_and_resize <- function(..., resize = img_size) {
  tf_read_image(..., resize = resize)
}

make_dataset <- function(paths_df) {
  tensor_slices_dataset(paths_df) |>
    dataset_map(function(path) {
      image <- path$input |>
        tf_read_image_and_resize("jpeg", channels = 3L) ## Reads images (3 channels)
      target <- path$target |>
        tf_read_image_and_resize("png", channels = 1L) ## Reads masks (1 channel)
      # target <- target - 1
      list(image, target) ## Stores image and corresponding mask
    }) |>
    dataset_cache() |> ## Dynamically caches the images
    dataset_shuffle(buffer_size = nrow(paths_df)) |> ## Shuffles images between runs
    dataset_batch(32)
}

Now let’s create the dataset. First, we’ll define the input image size - for this we’ll keep the images at their original size (128x128) but this can be used if the tiles are of different sizes to ensure all input tensors are the same. Second, we define the number of images to be used for validation (roughly 25% of the inputs). Third, we split the list of file names into training and validation. And finally, we make the two datasets

img_size <- c(128, 128)

num_val_samples <- 500
val_idx <- sample.int(nrow(image_paths), num_val_samples)

val_paths <- image_paths[val_idx, ]
train_paths <- image_paths[-val_idx, ]

validation_dataset <- make_dataset(val_paths)
train_dataset <- make_dataset(train_paths)

We’ll finish this section by defining a set of variables describing the images: the width and height, the number of channels and classes

image_width = img_size[1]
image_height = img_size[2]
num_channels = 3
num_classes = 6

UNet Model

Now let’s turn to building the model. We’ll use a basic UNet architecture for this. This has two sequential branches (encoder and decoder) as well as a number of skip connections. The encoder branch operates like a classic CNN, with convolution and pooling layers. The decoder reverses this, by upsampling to increase resolution and more convolutions. Practically, each branch has a series of steps which either decrease resolution (encoder) or increase it (decoder). The steps on each side match: so for example, the encoder could have step going from a resolution of 64 to 32, and the decoder has a matching set going from 32 to 64.

We’ll need to use some new layer types for this, so we’ll take a look at these first

Upsampling

Upsampling layers acts as the opposite to a max-pooling layer. Pooling reduces the size of the inputs, by only replacing a window of pixels (usually 2 by 2) with a single pixel containing the maximum value of the original 4. An upsampling layer will increase the resolution of the input according to a defined window (usually 2x2, meaning each original pixel is split into 4). There are two types of upsampling layers

UpSampling2D

This simply increases the resolution of the input. So an input pixel with the value of 2 will be split into 4, each with the value of 2:

In:  [2]
Out: [[2, 2],
      [2, 2]]

Conv2DTranspose

In addition to the upsampling, this layer applies convlutional filters. As a result, the value of the 4 output pixels are based on feature recognition in the coarser image, rather than simply using the same value

Skip connection

Skip connections are used to join the encoder and decoder branch. These join the matching encoder and decoder steps (e.g. the downsampling from 64 to 32 and the upsampling from 32 to 64). This is done through the use of concatenate layers. These link together output from different layers - for example, if you wanted to introduce two different sets of input features through different networks, a concatenate layer then merges these together before linking to the output.

To understand how this works for the UNet model, let’s say our input images are 128x128 pixels:

  • Step 1: The input is passed through a series of convolutions, and the output is set of transformed values at the same resolution (128x128)
  • Step 2: The output of step 1 is passed through a max-pooling step which reduced the resolution to 64x64
  • Step 3: The output of step 2 is passed through more convolutions (output size 64x64)
  • Step 4: The output of step 3 is upsampled back to 128x128
  • Step 5: the output of step 4 is concatenated with the output of step 1

In practice this is more complex as these skip connections are taking place at every down/up-sampling step.

Model architecture

Let’s actually build the model now so that you can see what this looks like. We’ll use the functional API which will allow us to build this in sections. One thing to note here is that (after the input), we store the layers in an object called x, then add the next layer to this so that it accumulates these:

  • Create a blank list to store layers (this will be used later to link the downward and upward path)
## To store the blocks for the downward pass
down_layers <- list()
  • Create the input layer, using the image size definitions. Link this to a rescaling layer (the input images are RGB with values from 0-255)
## Input
input <- layer_input(shape = c(image_width, image_height, num_channels))
x <- layer_rescaling(input, 1/255)
  • First downsampling block. This is the first of four downsampling blocks that make up the encoder. These will have the same format, but the number of convolutional filters will increase by 2 at each block:

  • A first convolutional layer

  • A dropout layer

  • A second convolutional layer

  • (Store the block)

  • A max-pooling layer

## ------------
# Encoder path: forward step 1
x <- layer_conv_2d(x, filters = 16, kernel_size = c(3,3),
                   activation = "relu", kernel_initializer = "he_normal", padding = "same")
x <- layer_dropout(x, rate = 0.1)
x <- layer_conv_2d(x, filters = 16, kernel_size = c(3,3),
                   activation = "relu", kernel_initializer = "he_normal", padding = "same")
## Store block
down_layers[[1]] <- x
## Max-pooling
x <- layer_max_pooling_2d(x, pool_size = c(2,2), strides = c(2,2))
  • The second downward block. Note that we increase the number of filters from 16 to 32:
## ------------
# Encoder path: forward step 2
x <- layer_conv_2d(x, filters = 32, kernel_size = c(3,3),
                   activation = "relu", kernel_initializer = "he_normal", padding = "same")
x <- layer_dropout(x, rate = 0.1)
x <- layer_conv_2d(x, filters = 32, kernel_size = c(3,3),
                   activation = "relu", kernel_initializer = "he_normal", padding = "same")
## Store block
down_layers[[2]] <- x
## Max-pooling
x <- layer_max_pooling_2d(x, pool_size = c(2,2), strides = c(2,2))
  • The third downward block. Note that we increase the number of filters from 32 to 64:
## ------------
# Encoder path: forward step 3
x <- layer_conv_2d(x, filters = 64, kernel_size = c(3,3),
                   activation = "relu", kernel_initializer = "he_normal", padding = "same")
x <- layer_dropout(x, rate = 0.1)
x <- layer_conv_2d(x, filters = 64, kernel_size = c(3,3),
                   activation = "relu", kernel_initializer = "he_normal", padding = "same")
## Store block
down_layers[[3]] <- x
## Max-pooling
x <- layer_max_pooling_2d(x, pool_size = c(2,2), strides = c(2,2))
  • The fourth downward block. Note that we increase the number of filters from 64 to 128:
## ------------
# Encoder path: forward step 4 
x <- layer_conv_2d(x, filters = 128, kernel_size = c(3,3),
                   activation = "relu", kernel_initializer = "he_normal", padding = "same")
x <- layer_dropout(x, rate = 0.1)
x <- layer_conv_2d(x, filters = 128, kernel_size = c(3,3),
                   activation = "relu", kernel_initializer = "he_normal", padding = "same")
## Store block
down_layers[[4]] <- x
## Max-pooling
x <- layer_max_pooling_2d(x, pool_size = c(2,2), strides = c(2,2))
  • Now we make the latent space block. This acts to connect the downward and upward path. This is the most abstract part of the model as it contains the fully filtered and pooled inputs. We pass this through more convolutional filters and another dropout
## ------------
# Latent space
## Add another dropout
x <- layer_dropout(x, rate = 0.1)
## Convolutional layer on latent space
x <- layer_conv_2d(x, filters = 256, kernel_size = c(3,3),
                   activation = "relu", kernel_initializer = "he_normal", padding = "same")
x <- layer_conv_2d(x, filters = 256, kernel_size = c(3,3),
                   activation = "relu", kernel_initializer = "he_normal", padding = "same")
  • Now we can start the upsampling path (the decoder). There will again be 4 of these to match the downsampling path, (so we’ll call the first one number four). Note that here we start with a high number of filters (128) and decrease by 50% for each new block. Each block will have the same format:
    • A Conv2DTranspose layer to upsample the inputs, increasing the resolution
    • A concatenate layer that links this to the corresponding downsampling block (this will be the fourth one)
    • A first convolutional layer
    • A dropout layer
    • A second convolutional layer
## ------------
# Decoder path 4
x <- layer_conv_2d_transpose(x, filters = 128, kernel_size = c(2,2),
  padding = "same", strides = c(2,2))
x <- layer_concatenate(list(down_layers[[4]], x))
x <- layer_conv_2d(x, filters = 128, kernel_size = c(3,3),
                   activation = "relu", kernel_initializer = "he_normal", padding = "same")
x <- layer_dropout(x, rate = 0.1)
x <- layer_conv_2d(x, filters = 128, kernel_size = c(3,3),
                   activation = "relu", kernel_initializer = "he_normal", padding = "same")
  • The third upward block. Note that we decrease the number of filters from 128 to 64:
## ------------
# Decoder path 3
x <- layer_conv_2d_transpose(x, filters = 64, kernel_size = c(2,2),
                                    padding = "same", strides = c(2,2))
x <- layer_concatenate(list(down_layers[[3]], x))
x <- layer_conv_2d(x, filters = 64, kernel_size = c(3,3),
                   activation = "relu", kernel_initializer = "he_normal", padding = "same")
x <- layer_dropout(x, rate = 0.1)
x <- layer_conv_2d(x, filters = 64, kernel_size = c(3,3),
                   activation = "relu", kernel_initializer = "he_normal", padding = "same")
  • The second upward block. Note that we decrease the number of filters from 64 to 32:
## ------------
# Decoder path 2
x <- layer_conv_2d_transpose(x, filters = 32, kernel_size = c(2,2),
                                    padding = "same", strides = c(2,2))
x <- layer_concatenate(list(down_layers[[2]], x))
x <- layer_conv_2d(x, filters = 32, kernel_size = c(3,3),
                   activation = "relu", kernel_initializer = "he_normal", padding = "same")
x <- layer_dropout(x, rate = 0.1)
x <- layer_conv_2d(x, filters = 32, kernel_size = c(3,3),
                   activation = "relu", kernel_initializer = "he_normal", padding = "same")
  • The first upward block. Note that we decrease the number of filters from 32 to 16:
## ------------
# Decoder path 1
x <- layer_conv_2d_transpose(x, filters = 16, kernel_size = c(2,2),
                                    padding = "same", strides = c(2,2))
x <- layer_concatenate(list(down_layers[[1]], x))
x <- layer_conv_2d(x, filters = 16, kernel_size = c(3,3),
                   activation = "relu", kernel_initializer = "he_normal", padding = "same")
x <- layer_dropout(x, rate = 0.1)
x <- layer_conv_2d(x, filters = 16, kernel_size = c(3,3),
                   activation = "relu", kernel_initializer = "he_normal", padding = "same")
  • We now make the final layer, the output layer. This is a slightly unusual layer, as it is a convolutional layer, but with a 1x1 window size. This acts a little like the flatten layer we have previously used, but here forces the output into a shape that is compatible with the masks. The masks have 6 channels (one for each class)
## ------------
# Output layer
output <- layer_conv_2d(x, filters = num_classes,
                               kernel_size = c(1,1), activation = "softmax")

With all that done, we can now make the model by linking the input layers and the output:

model <- keras_model(input, output)

Let’s take a look at the model summary:

summary(model)
Model: "functional"
┏━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━┓
┃ Layer (type)          ┃ Output Shape      ┃     Param # ┃ Connected to       ┃
┡━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━┩
│ input_layer           │ (None, 128, 128,  │           0 │ -                  │
│ (InputLayer)          │ 3)                │             │                    │
├───────────────────────┼───────────────────┼─────────────┼────────────────────┤
│ rescaling (Rescaling) │ (None, 128, 128,  │           0 │ input_layer[0][0]  │
│                       │ 3)                │             │                    │
├───────────────────────┼───────────────────┼─────────────┼────────────────────┤
│ conv2d (Conv2D)       │ (None, 128, 128,  │         448 │ rescaling[0][0]    │
│                       │ 16)               │             │                    │
├───────────────────────┼───────────────────┼─────────────┼────────────────────┤
│ dropout (Dropout)     │ (None, 128, 128,  │           0 │ conv2d[0][0]       │
│                       │ 16)               │             │                    │
├───────────────────────┼───────────────────┼─────────────┼────────────────────┤
│ conv2d_1 (Conv2D)     │ (None, 128, 128,  │       2,320 │ dropout[0][0]      │
│                       │ 16)               │             │                    │
├───────────────────────┼───────────────────┼─────────────┼────────────────────┤
│ max_pooling2d         │ (None, 64, 64,    │           0 │ conv2d_1[0][0]     │
│ (MaxPooling2D)        │ 16)               │             │                    │
├───────────────────────┼───────────────────┼─────────────┼────────────────────┤
│ conv2d_2 (Conv2D)     │ (None, 64, 64,    │       4,640 │ max_pooling2d[0][… │
│                       │ 32)               │             │                    │
├───────────────────────┼───────────────────┼─────────────┼────────────────────┤
│ dropout_1 (Dropout)   │ (None, 64, 64,    │           0 │ conv2d_2[0][0]     │
│                       │ 32)               │             │                    │
├───────────────────────┼───────────────────┼─────────────┼────────────────────┤
│ conv2d_3 (Conv2D)     │ (None, 64, 64,    │       9,248 │ dropout_1[0][0]    │
│                       │ 32)               │             │                    │
├───────────────────────┼───────────────────┼─────────────┼────────────────────┤
│ max_pooling2d_1       │ (None, 32, 32,    │           0 │ conv2d_3[0][0]     │
│ (MaxPooling2D)        │ 32)               │             │                    │
├───────────────────────┼───────────────────┼─────────────┼────────────────────┤
│ conv2d_4 (Conv2D)     │ (None, 32, 32,    │      18,496 │ max_pooling2d_1[0… │
│                       │ 64)               │             │                    │
├───────────────────────┼───────────────────┼─────────────┼────────────────────┤
│ dropout_2 (Dropout)   │ (None, 32, 32,    │           0 │ conv2d_4[0][0]     │
│                       │ 64)               │             │                    │
├───────────────────────┼───────────────────┼─────────────┼────────────────────┤
│ conv2d_5 (Conv2D)     │ (None, 32, 32,    │      36,928 │ dropout_2[0][0]    │
│                       │ 64)               │             │                    │
├───────────────────────┼───────────────────┼─────────────┼────────────────────┤
│ max_pooling2d_2       │ (None, 16, 16,    │           0 │ conv2d_5[0][0]     │
│ (MaxPooling2D)        │ 64)               │             │                    │
├───────────────────────┼───────────────────┼─────────────┼────────────────────┤
│ conv2d_6 (Conv2D)     │ (None, 16, 16,    │      73,856 │ max_pooling2d_2[0… │
│                       │ 128)              │             │                    │
├───────────────────────┼───────────────────┼─────────────┼────────────────────┤
│ dropout_3 (Dropout)   │ (None, 16, 16,    │           0 │ conv2d_6[0][0]     │
│                       │ 128)              │             │                    │
├───────────────────────┼───────────────────┼─────────────┼────────────────────┤
│ conv2d_7 (Conv2D)     │ (None, 16, 16,    │     147,584 │ dropout_3[0][0]    │
│                       │ 128)              │             │                    │
├───────────────────────┼───────────────────┼─────────────┼────────────────────┤
│ max_pooling2d_3       │ (None, 8, 8, 128) │           0 │ conv2d_7[0][0]     │
│ (MaxPooling2D)        │                   │             │                    │
├───────────────────────┼───────────────────┼─────────────┼────────────────────┤
│ dropout_4 (Dropout)   │ (None, 8, 8, 128) │           0 │ max_pooling2d_3[0… │
├───────────────────────┼───────────────────┼─────────────┼────────────────────┤
│ conv2d_8 (Conv2D)     │ (None, 8, 8, 256) │     295,168 │ dropout_4[0][0]    │
├───────────────────────┼───────────────────┼─────────────┼────────────────────┤
│ conv2d_9 (Conv2D)     │ (None, 8, 8, 256) │     590,080 │ conv2d_8[0][0]     │
├───────────────────────┼───────────────────┼─────────────┼────────────────────┤
│ conv2d_transpose      │ (None, 16, 16,    │     131,200 │ conv2d_9[0][0]     │
│ (Conv2DTranspose)     │ 128)              │             │                    │
├───────────────────────┼───────────────────┼─────────────┼────────────────────┤
│ concatenate           │ (None, 16, 16,    │           0 │ conv2d_7[0][0],    │
│ (Concatenate)         │ 256)              │             │ conv2d_transpose[… │
├───────────────────────┼───────────────────┼─────────────┼────────────────────┤
│ conv2d_10 (Conv2D)    │ (None, 16, 16,    │     295,040 │ concatenate[0][0]  │
│                       │ 128)              │             │                    │
├───────────────────────┼───────────────────┼─────────────┼────────────────────┤
│ dropout_5 (Dropout)   │ (None, 16, 16,    │           0 │ conv2d_10[0][0]    │
│                       │ 128)              │             │                    │
├───────────────────────┼───────────────────┼─────────────┼────────────────────┤
│ conv2d_11 (Conv2D)    │ (None, 16, 16,    │     147,584 │ dropout_5[0][0]    │
│                       │ 128)              │             │                    │
├───────────────────────┼───────────────────┼─────────────┼────────────────────┤
│ conv2d_transpose_1    │ (None, 32, 32,    │      32,832 │ conv2d_11[0][0]    │
│ (Conv2DTranspose)     │ 64)               │             │                    │
├───────────────────────┼───────────────────┼─────────────┼────────────────────┤
│ concatenate_1         │ (None, 32, 32,    │           0 │ conv2d_5[0][0],    │
│ (Concatenate)         │ 128)              │             │ conv2d_transpose_… │
├───────────────────────┼───────────────────┼─────────────┼────────────────────┤
│ conv2d_12 (Conv2D)    │ (None, 32, 32,    │      73,792 │ concatenate_1[0][… │
│                       │ 64)               │             │                    │
├───────────────────────┼───────────────────┼─────────────┼────────────────────┤
│ dropout_6 (Dropout)   │ (None, 32, 32,    │           0 │ conv2d_12[0][0]    │
│                       │ 64)               │             │                    │
├───────────────────────┼───────────────────┼─────────────┼────────────────────┤
│ conv2d_13 (Conv2D)    │ (None, 32, 32,    │      36,928 │ dropout_6[0][0]    │
│                       │ 64)               │             │                    │
├───────────────────────┼───────────────────┼─────────────┼────────────────────┤
│ conv2d_transpose_2    │ (None, 64, 64,    │       8,224 │ conv2d_13[0][0]    │
│ (Conv2DTranspose)     │ 32)               │             │                    │
├───────────────────────┼───────────────────┼─────────────┼────────────────────┤
│ concatenate_2         │ (None, 64, 64,    │           0 │ conv2d_3[0][0],    │
│ (Concatenate)         │ 64)               │             │ conv2d_transpose_… │
├───────────────────────┼───────────────────┼─────────────┼────────────────────┤
│ conv2d_14 (Conv2D)    │ (None, 64, 64,    │      18,464 │ concatenate_2[0][… │
│                       │ 32)               │             │                    │
├───────────────────────┼───────────────────┼─────────────┼────────────────────┤
│ dropout_7 (Dropout)   │ (None, 64, 64,    │           0 │ conv2d_14[0][0]    │
│                       │ 32)               │             │                    │
├───────────────────────┼───────────────────┼─────────────┼────────────────────┤
│ conv2d_15 (Conv2D)    │ (None, 64, 64,    │       9,248 │ dropout_7[0][0]    │
│                       │ 32)               │             │                    │
├───────────────────────┼───────────────────┼─────────────┼────────────────────┤
│ conv2d_transpose_3    │ (None, 128, 128,  │       2,064 │ conv2d_15[0][0]    │
│ (Conv2DTranspose)     │ 16)               │             │                    │
├───────────────────────┼───────────────────┼─────────────┼────────────────────┤
│ concatenate_3         │ (None, 128, 128,  │           0 │ conv2d_1[0][0],    │
│ (Concatenate)         │ 32)               │             │ conv2d_transpose_… │
├───────────────────────┼───────────────────┼─────────────┼────────────────────┤
│ conv2d_16 (Conv2D)    │ (None, 128, 128,  │       4,624 │ concatenate_3[0][… │
│                       │ 16)               │             │                    │
├───────────────────────┼───────────────────┼─────────────┼────────────────────┤
│ dropout_8 (Dropout)   │ (None, 128, 128,  │           0 │ conv2d_16[0][0]    │
│                       │ 16)               │             │                    │
├───────────────────────┼───────────────────┼─────────────┼────────────────────┤
│ conv2d_17 (Conv2D)    │ (None, 128, 128,  │       2,320 │ dropout_8[0][0]    │
│                       │ 16)               │             │                    │
├───────────────────────┼───────────────────┼─────────────┼────────────────────┤
│ conv2d_18 (Conv2D)    │ (None, 128, 128,  │         102 │ conv2d_17[0][0]    │
│                       │ 6)                │             │                    │
└───────────────────────┴───────────────────┴─────────────┴────────────────────┘
 Total params: 1,941,190 (7.41 MB)
 Trainable params: 1,941,190 (7.41 MB)
 Non-trainable params: 0 (0.00 B)

This model has 1.94 million weights or parameters to train. This is fairly common with any large CNN-type model, and is why we generally need a large amount of data to train.

We can also visualize the architecture. You should be able to see a ‘C’ like structure between the downward and upward paths of the model. In the original paper, this was shown rotated 90 degrees to the left, hence the name ’U’Net. (Note that you might need to save this and zoom in to see the detail.)

plot(model)

Performance metrics

We’ll use the accuracy to assess the model (alternatively, we could use the intersection over union).

metrics = "accuracy"

Optimizer

We’ll set the optimizer to RMSprop with a learning rate of 1e-3:

optim = optimizer_rmsprop(learning_rate = 1e-3)

Training

Let’s compile the model and create a callback to save the best performing set of weights during training

model |>
  compile(optimizer = optim,
          loss = "sparse_categorical_crossentropy",
          metrics = metrics)   
callbacks <- list(
  callback_model_checkpoint("lulc_segmentation.keras",
                            save_best_only = TRUE))   

With all that in place, we can train the model. We’ll use batchs of 64 images, and run for 25 epochs.

history <- model |> fit(
  train_dataset,
  epochs = 25,
  callbacks = callbacks,
  validation_data = validation_dataset
)
Epoch 1/25
48/48 - 27s - 558ms/step - accuracy: 0.4014 - loss: 1.3712 - val_accuracy: 0.5243 - val_loss: 1.1535
Epoch 2/25
48/48 - 25s - 525ms/step - accuracy: 0.5256 - loss: 1.1602 - val_accuracy: 0.6087 - val_loss: 1.1859
Epoch 3/25
48/48 - 26s - 541ms/step - accuracy: 0.6187 - loss: 1.0193 - val_accuracy: 0.6388 - val_loss: 0.9687
Epoch 4/25
48/48 - 28s - 577ms/step - accuracy: 0.6486 - loss: 0.9410 - val_accuracy: 0.6860 - val_loss: 0.8606
Epoch 5/25
48/48 - 27s - 571ms/step - accuracy: 0.6658 - loss: 0.9032 - val_accuracy: 0.6963 - val_loss: 0.8223
Epoch 6/25
48/48 - 26s - 549ms/step - accuracy: 0.6880 - loss: 0.8588 - val_accuracy: 0.7074 - val_loss: 0.8047
Epoch 7/25
48/48 - 26s - 546ms/step - accuracy: 0.7137 - loss: 0.8118 - val_accuracy: 0.6007 - val_loss: 1.1324
Epoch 8/25
48/48 - 26s - 551ms/step - accuracy: 0.7244 - loss: 0.7872 - val_accuracy: 0.6884 - val_loss: 0.8895
Epoch 9/25
48/48 - 26s - 545ms/step - accuracy: 0.7360 - loss: 0.7586 - val_accuracy: 0.6670 - val_loss: 0.9045
Epoch 10/25
48/48 - 26s - 537ms/step - accuracy: 0.7503 - loss: 0.7196 - val_accuracy: 0.6533 - val_loss: 0.9758
Epoch 11/25
48/48 - 26s - 546ms/step - accuracy: 0.7603 - loss: 0.6939 - val_accuracy: 0.7539 - val_loss: 0.7051
Epoch 12/25
48/48 - 26s - 551ms/step - accuracy: 0.7661 - loss: 0.6694 - val_accuracy: 0.6518 - val_loss: 1.0766
Epoch 13/25
48/48 - 26s - 543ms/step - accuracy: 0.7680 - loss: 0.6641 - val_accuracy: 0.7783 - val_loss: 0.6228
Epoch 14/25
48/48 - 27s - 559ms/step - accuracy: 0.7753 - loss: 0.6393 - val_accuracy: 0.7483 - val_loss: 0.7113
Epoch 15/25
48/48 - 27s - 552ms/step - accuracy: 0.7780 - loss: 0.6283 - val_accuracy: 0.7812 - val_loss: 0.6273
Epoch 16/25
48/48 - 27s - 556ms/step - accuracy: 0.7819 - loss: 0.6222 - val_accuracy: 0.7850 - val_loss: 0.6125
Epoch 17/25
48/48 - 27s - 566ms/step - accuracy: 0.7847 - loss: 0.6080 - val_accuracy: 0.7556 - val_loss: 0.6768
Epoch 18/25
48/48 - 27s - 558ms/step - accuracy: 0.7869 - loss: 0.6016 - val_accuracy: 0.7657 - val_loss: 0.6894
Epoch 19/25
48/48 - 27s - 553ms/step - accuracy: 0.7885 - loss: 0.5976 - val_accuracy: 0.7865 - val_loss: 0.6320
Epoch 20/25
48/48 - 26s - 550ms/step - accuracy: 0.7898 - loss: 0.5871 - val_accuracy: 0.7630 - val_loss: 0.6488
Epoch 21/25
48/48 - 26s - 545ms/step - accuracy: 0.7955 - loss: 0.5752 - val_accuracy: 0.7914 - val_loss: 0.5822
Epoch 22/25
48/48 - 26s - 544ms/step - accuracy: 0.7973 - loss: 0.5687 - val_accuracy: 0.7904 - val_loss: 0.5740
Epoch 23/25
48/48 - 26s - 541ms/step - accuracy: 0.8012 - loss: 0.5580 - val_accuracy: 0.8032 - val_loss: 0.5649
Epoch 24/25
48/48 - 26s - 540ms/step - accuracy: 0.8009 - loss: 0.5583 - val_accuracy: 0.8050 - val_loss: 0.5456
Epoch 25/25
48/48 - 26s - 551ms/step - accuracy: 0.8060 - loss: 0.5425 - val_accuracy: 0.8052 - val_loss: 0.5543

And let’s plot the history

plot(history)

The loss curve is noisy but shows a fairly consistent decline. As it has not yet plateaued, it may be worth increasing the number of epochs to train for longer.

Model evaluation

To finish up, we’ll take a look at how well the model can segment an image. As we don’t have a separate testing set, we’ll simply use one of the images from the validation set. The steps here are to

  1. Reload the model weight
model <- load_model("lulc_segmentation.keras")
  1. load an image (and mask)
i = 1
test_image <- val_paths$input[i] |>
  tf_read_image_and_resize("jpeg", channels = 3L)

test_mask <- val_paths$target[i] |>
  tf_read_image_and_resize("png", channels = 1L)
  1. use the model predict function to estimate the probability of each class for each pixel
predicted_mask_probs <-
  model(test_image[tf$newaxis, , , ])
  1. visualize the prediction (the class with the highest probability)
predicted_mask <-
  tf$argmax(predicted_mask_probs, axis = -1L)
par(mfrow = c(1, 3))
display_image_tensor(test_image)
display_target_tensor(test_mask)
display_target_tensor(predicted_mask)

The resulting segmentation is far from perfect here, but given the size of the input data and the relatively short training period, it is already starting to capture the spatial patterns in this image. The next steps are likely to be:

  • Add the full set of images
  • Train for longer (100 epochs)
  • Add data augmentation